Exploring facial expressions and action unit domains for Parkinson detection

Background and objective Patients suffering from Parkinson’s disease (PD) present a reduction in facial movements called hypomimia. In this work, we propose to use machine learning facial expression analysis from face images based on action unit domains to improve PD detection. We propose different domain adaptation techniques to exploit the latest advances in automatic face analysis and face action unit detection. Methods Three different approaches are explored to model facial expressions of PD patients: (i) face analysis using single frame images and also using sequences of images, (ii) transfer learning from face analysis to action units recognition, and (iii) triplet-loss functions to improve the automatic classification between patients and healthy subjects. Results Real face images from PD patients show that it is possible to properly model elicited facial expressions using image sequences (neutral, onset-transition, apex, offset-transition, and neutral) with accuracy improvements of up to 5.5% (from 72.9% to 78.4%) with respect to single-image PD detection. We also show that our proposed action unit domain adaptation provides improvements of up to 8.9% (from 78.4% to 87.3%) with respect to face analysis. Finally, we also show that triplet-loss functions provide improvements of up to 3.6% (from 78.8% to 82.4%) with respect to action unit domain adaptation applied upon models created from scratch. The code of the experiments is available at https://github.com/luisf-gomez/Explorer-FE-AU-in-PD. Conclusions Domain adaptation via transfer learning methods seem to be a promising strategy to model hypomimia in PD patients. Considering the good results and also the fact that only up to five images per participant are considered in each sequence, we believe that this work is a step forward in the development of inexpensive computational systems suitable to model and quantify problems of PD patients in their facial expressions.

Introduction refers to the evaluation of the capability of PD patients to effectively recognize different expressions or emotions when watching at faces. Facial Expressivity Evaluation (FEE) refers to the capability of the patient to produce different facial expressions or emotions. Both aspects have a very important role in social interaction and non-verbal communication. The first one has been studied for several decades mainly by psychologists in different works and the main findings are summarized in a relatively recent study [15]. On the other hand, FEE has become a popular field among engineers and computer scientists, which opens space for research in different applications related to affective computing.
During the past two decades, the affective computing community has made great advances in developing novel technologies to model facial expressions and emotional information [16][17][18]. One of the goals of affective technologies is to create computational models with the ability to recognize, interpret, and process human emotions, making human-computer interaction more useful. Sentiment analysis and affective computing have been continuously studied since the 20th century, helping in the development of computer vision systems [19][20][21], in the creation of entertainment [22], and in the development of systems to aid different areas of medicine including neurology [23][24][25].
Our work is focused on the study of FEE in PD patients. The main aim is to consider videos collected from patients to evaluate their capability to elicit specific facial expressions and to compare such a capability with respect to healthy subjects using recent advances in Action Unit domains. This work presents three different approaches: (i) the face analysis domain which is based on single images and image sequences extracted from the participants' videos, (ii) the action unit domain which is created by applying transfer learning from the face analysis domain, and (iii) a specific analysis domain, focused on information from PD patients, that results from using the triplet loss function to improve the classification between PD patients and healthy subjects.
The rest of the paper is organized as follows: Related Works provides an overview about the literature on FEE. Contributions of this Work describes the contributions of this work in the topic of hypomimia modeling in Parkinson's disease. Materials and Methods presents the experimental framework, including the description of the datasets and the methods. Experiments and Results summarizes the experiments and results. Finally, the discussion, conclusions and future work are drawn in Discussion and Conclusion.

Related works
One of the earliest studies about FEE in PD patients was conducted in 2004 by Simons et al. [26] The authors evaluated the capability of 19 PD patients and 25 healthy subjects to pose and imitate different facial expressions. Videos with social interactions were used to evoke emotional responses in the patients faces. The videos were manually analyzed and the participants' expressiveness was rated according to subjective rating scales, objective facial measurements, and self-questionnaires. The objective measurement was based on the facial action coding system presented in [27], where the facial expression is decomposed according to specific facial muscle movements like rising eyebrows and wrinkling the nose. The results of the study indicated that patients with PD have reduced capability to produce spontaneous facial expressions in all experimental situations. Two years later in [28], the authors presented a work where expressivity and bradykinesia were studied. The authors hypothesized that intentional facial expressions are slowed (bradykinetic) and with less movement in PD patients than in healthy controls. This hypothesis was basically inspired in other intentional movements performed by PD patients, e.g., walking, where bradykinesia is also observed. Digitized videos were evaluated frame-by-frame and the entropy in temporal changes of pixel intensity was measured [29].
The authors found that PD patients had reduced entropy compared to healthy controls, and were significantly slower in reaching a peak expression (p < 0.0001), which is directly associated to bradykinesia.
In 2016 Almutiry et al. [30] presented perhaps the only longitudinal study about FEE in PD patients. A total of 8 subjects (4 PD and 4 healthy controls) participated in the study. Patients were recorded for five days per week (once per day) during six weeks while controls were recorded for five days within one week. Participants were requested to produce specific facial expressions while being recorded. The authors used two classical feature extraction methods to localise 27 facial features: Active Appearance Model (AAM) and Constrained Local Model (CLM). The results suggested that PD patients exhibit less movement than controls, which confirms the observations made ten years earlier by Bowers et al. [28].
In 2017, Gunnery et al. [31] studied the coordination of movements across regions of the face in 8 PD patients (4 female). They used the facial action coding system [27,32] to measure spontaneous facial expressions. The number of activated frames per action unit and their intensity was manually labeled. Correlations were computed for activation values obtained across different regions of the face. The results showed that as severity of facial expression deficit increased, there was a decrease in number, duration, intensity, and co-activation of facial muscle action. In the same year, Bandini et al. [14] classified emotions expressed by 17 PD patients (13 male) and equal number of healthy controls (6 male). Different emotions were evaluated including happiness, anger, disgust, and sadness. Different areas of the face were modeled with 49 landmarks [33,34], including: eyes, eyebrows, mouth, and nose. A total of 20 features were extracted to define a linear combination of specific reference points. Acted and imitated facial expressions were considered. An support vector machine (SVM) was trained to automatically detect different emotions expressed by participants. The results with imitated expressions showed higher accuracies for healthy controls in most of the emotions. The only case where the PD patients displayed an expression better than the healthy subjects was sadness. When acted expressions were evaluated, the authors found also higher accuracies for healthy subjects than for PD patients.
Other contributions in the topic of FEE in PD include the study of Kang et al [35]. The authors evaluated whether deficiencies in the orofacial movements of PD patients occur in spontaneous and voluntary expressions. Muscular activation (related with specific regions in the face) were studied considering electro-myography signals. Data from the East Asian Dynamic Facial Expression Stimuli (EADFES) database was used [36]. A group with 20 PD patients and 20 healthy controls was evaluated; the authors report limitations of patients to express emotions spontaneously, although the observed dynamics in the movement of the face are similar across all subjects. The study also highlighted the deterioration in the patient's quality of life due to the presence of "masked face", affecting social and psychological aspects and increasing their risk to develop depression-related symptoms. The study presented in [15] suggested that PD patients present a deficit in emotion expressivity. According to the results obtained in [15], the deficit seems to be greater for the basic negative emotions (sadness and anger). The basic negative emotions are associated with the following Face Action Units: sadness: 1, 4, 6, 11, 15, 17 and anger: 4, 5, 7, 10, 17, 22-26. More recently, in another line of work, Grammatikopoulou et al. [37] analyzed facial expressions from images captured with smartphones. Geometric features of the face were extracted and stored in the cloud. A total of 34 participants were recruited, 23 with PD and 11 healthy controls. Patients were divided into three groups according to the facial expression score of the MDS-UPDRS-III scale. The authors extracted two feature sets: one by using the Google Face API and the other one using the Microsoft Face API [38]. The feature sets were composed by reference points on the faces, then two linear regression models were developed (one per feature set) to estimate two different values of the Hypomimia Severity index, namely HSi1 and HSi2. These two indexes were used to classify between Parkinson's patients and healthy people. The reported sensitivity and specificity values were 0.79 and 0.82, respectively for HSi1 while 0.89 and 0.73 for HSi2. Other contributions include Ali et al. [39]; the authors used OpenFace to evaluate the variance in the action units predictions in PARK dataset [40]. The dataset contains 604 subjects, with 61 PD patients and 534 healthy controls evoking three different expressions. They analyzed three Action Units per expression and an SVM to classify between PD and healthy. The reported accuracies, precision, and recall of 95.6, 95.8, and 94.3, respectively.
In other works, Rajnoha et al. [41] used a face analysis convolutional neural network to extract features over 100 subjects (50 PD and 50 healthy controls) and then used traditional classifiers such as K-nearest neighbors, XGBoost, decision trees, random forest, and SVM to classify PD patients. The reported accuracies for the best classifier was 67.33 for decision trees. Furthermore, Gomez et al. [42] presented a multimodal study based on static and dynamic features for Parkinson's detection in 4 facial gestures. 17 dynamic features are extracted from a linear combination in an automatic facial mesh [43], and 2048 features are obtained from the maximum peak of the facial gesture. The experiments were carried out on the FacePark-GITA database including 54 participants were recruited, 30 with PD and 24 healthy controls. They reported accuracies of 77.36 and 71.15 only in static and dynamic features respectively, and reported accuracies until 88.76 when both approaches were combined. Additionally, in 2020 Sonawane and Sharma [44] presented a review of automatic techniques and the use of machine learning in detecting emotional facial expressions in PD patients. The authors show that the use of deep learning in this field has not been adequately addressed yet in the classification between healthy people and PD patients. Also, they conducted a pilot experiment based on the use of one CNN from scratch for masked faces detection. In the same year, Jin et al. [45] presented a traditional classifier and recurrent neural networks (RNN) with features based on 106 facial landmarks using Face++. The feature extraction considers the amplitude and tremor of different facial landmarks. The authors evaluated a group with 33 PD patients and 31 healthy controls; the authors reported precision and recall from 0.93 for traditional classifiers and 0.86 for a Long Term Short-Time (LSTM) classifier. The experiments described in [44,45] show that deep learning-based models can be helpful for classification. Finally, to provide an overall picture, we present Table 1, which contains several machine learning studies related to Parkinson's disease.

Contributions of this work
As shown in the literature review, there is a lack of work in the field of FEE for modeling hypomimia in PD patients with latest affective models including deep learning techniques. One of the main reasons for this lack of deep approaches is the absence of large scale databases with PD patients. In contrast, Face Analysis and Affective Computing research communities have made great efforts to release databases with millions of samples. In this work, we propose to use facial expression analysis and Action Unit domains to improve the PD detection. We propose different domain adaptation techniques [46,47] to exploit the latest developments in Face Analysis and Face Action Unit (FAU) detection [48]. The main contributions of this paper are: (1) a novel framework to exploit deep face architectures to model hypomimia in PD patients; (2) the comparison of PD detection accuracies based on single images vs. image sequences while the patients elicited various face expressions; (3) we explored different domain adaptation techniques to exploit existing models initially trained either for Face Analysis or to detect FAUs for the automatic discrimination between PD patients and healthy subjects; and (4) a new approach to use triplet-loss learning to improve hypomimia modeling and PD detection.

Materials and methods
Let's assume that w FA is a model trained for Face Analysis tasks and the representation x FA is a feature vector generated by the model (typically from the last layers of a Convolutional Neural Network) from an input face image. This representation x FA is learned to describe the face image in a projected space where faces from the same person remain closer than faces from different persons. Similarly, models and representations can be trained for different tasks such as Action Unit recognition (w AU ) (e.g., in the form of facial gestures) or Parkinson's Disease detection (w PD ). Domain adaptation refers to methods that serve to adapt a representation x A trained for the domain A to a new domain B (typically a domain with similar characteristics to A but less information to train). The resulting representation x B , adapted from x A , is expected to perform better than a representation trained from scratch for the domain B.
We propose an experimental framework where Action Unit features are explored at different levels (or domains). The list of domains and the corresponding underlying hypotheses to be explored are presented below. (See also Fig 1).
Face Analysis Domain (Level 1). We propose to use pre-trained Face Analysis models to extract face representations (namely x FA ) for Parkinson's Detection. • Experiment: we evaluate the performance of PD detection for different sequences of face gestures, including right eye wink, left eye wink, smile, anger and surprise, using pre-trained Face Analysis models (w FA trained with VGGFace2 [49]).

Action Unit Domain (Level 2).
We propose to improve the learned Face Analysis representations (x FA ) for Parkinson Detection by incorporating an Action Unit domain adaptation w AU training process: • Hypothesis (H2): automatic detection of hypomimia is improved when features from the action unit domain are incorporated to the representations. The representation x AU performs better for Parkinson Detection than the representation x FA .
• Experiment: the pre-trained models (w FA ) are adapted to the Action Unit domain (w AU ) using the EmotioNet database [50]. Both, the performance of x FA and x AU are evaluated for Parkinson Detection.

Parkinson Domain (Level 3).
We evaluate the performance obtained by representations x PD trained with Healthy and Parkinson patients and the Triplet Loss function: • Hypothesis (H3): similarity learning functions designed to enhance the Parkinson features can serve to improve the capability to detect hypomimia.
• Experiment: the Action Unit model (w AU ) is adapted to the Parkinson domain using the Triplet Loss function and the FacePark-GITA database (see Parkinson Domain: FacePark-GITA Section for details).
Details of the methods implemented to validate all hypotheses are presented in Methods.

Databases
Three different databases are considered in this work. VGGFace2 [49] and EmotioNet [50] which are popular for Face Analysis and Face Action Unit detection, respectively. The third one is a new database composed by PD patients and healthy subjects. It contains face videos of patients suffering from Parkinson's disease and age-matched healthy controls. This new corpus is called FacePark-GITA. Details of each database are presented below. Face analysis domain: VGGFace2. This database comprises more than 3.31 million faces from 9,131 different subjects. An average of 362.6 images per subject are included [49]. The images were downloaded from Google Image Search. The corpus has large variations in pose, age, lighting, ethnicity, and profession. This database is popular in the Face Recognition community and it has been extensively used to train competitive recognition models [51,52].
Action unit domain: EmotioNet. This database was originally introduced by researchers from the Ohio State University who released the EmotioNet Challenge in 2017 [50]. This database contains one million facial expression images collected from the Internet. A total of 950,000 images were annotated by the automatic Action Unit (AU) detection model presented in [50], and the remaining 50,000 images were manually annotated by experts. A total of 12 AUs are included in the corpus.
Parkinson domain: FacePark-GITA. The database was created by GITA Lab. The recording of patients is still ongoing and the most updated version of the corpus contains video recordings of 24 healthy participants and 30 PD patients. The videos were recorded at 15 frames per second in non-controlled environment conditions, i.e., light conditions and the background were not controlled prior the recording and differ among participants. PD patients were diagnosed by a neurologist expert and were evaluated according to the MDS-UPDRS-III scale and the Hoehn and Yahr scale (H&Y) [53]. A summary of the clinical and demographic information is presented in Table 2.
The participants of this study were asked to elicit different facial expressions while being recorded. A total of five video-task recordings are included per participant: right eye wink, left eye wink, smile, anger, and surprise. The average duration of each video is 6 seconds. Patients have an average age of 69 years old and healthy subjects were chosen with a similar range of age. Possible bias introduced by age or gender were discarded via a chi-square statistical test (p = 0.44) and a Welch's t-test (p = 0.15), respectively.
Ethical approval. All of the signals considered in this work were collected in compliance with the Helsinki Declaration and the procedure was approved by the Ethics Committee (CBE-SIU) at the University of Antioquia in Medellín, Colombia. # 19-63-673 of April 25th, 2019. All participants signed a written informed consent before the recording. The individual in this manuscript has given written informed consent (as outlined in PLOS consent form) to publish these case details.

Methods
Image sequences extraction. Each video from the FacePark-GITA corpus corresponds to a different facial expression: smile, anger, surprise, left eye wink, or right eye wink. Five frames per video-task were extracted with the software Affectiva (available at https://www.affectiva. com/). The curve of valence provided by the software is used as the criterion to select the following sequence of five images/frames per participant on each expression: (i) neutral; (ii) the transition from neutral to the apex (i.e., onset); (iii) apex; (iv) the transition from the apex to neutral (i.e., offset); and (v) neutral. The total of frames used is 1350 frames (5 frames/expression × 5 expressions/user × 54 users). The sequence of images and their direct relation with the valence curve are illustrated in Fig 2. Given the small amount of information provided by individual frames, and considering that extending the analyses to full video-frames would have increased the computational cost and complexity dramatically, we decided to consider multiframe sequences in a simple information fusion architecture based score fusion [54]. Notice that this approach allows us to capture changes during the production of facial expressions. The general idea was already studied in [55] for speech signals, where the author hypothesised that PD patients have more difficulties to start or stop the movement of muscles and limbs during speech production. The idea was later extended to other motor skills like handwriting and gait [9,56].
As in the cases of speech, gait, and handwriting, we believe that the same hypothesis holds during the production of facial expressions. Thus, the analysis of multiple-frames in facial expressions should provides useful information to discriminate between PD patients and healthy subjects. The aforementioned idea is implemented by the extraction of the following multi-frame sequences: • NOnA: Neutral, Onset, and Apex.
• NOnAOffN: Neutral, Onset, Apex, Offset, and Neutral. Face analysis pre-trained model. In this work, we employ the ResNet50 architecture [52], with 50 layers and 25.6M parameters. The architecture adds skip connections to allow gradients to smoothly passed back to early layers. This model is used to generate an initial face representation. The ResNet50 model was originally proposed for general image recognition tasks and later it was retrained with the VGGFace2 database [49] for face recognition. This architecture has been extensively used as a starting point in the Facial expressions analysis [57][58][59] and Action Units recognition in competitions like Affective Behavior Analysis in-thewild (ABAW) in FG 2020 [60], ICCV 2021 [61], and CVPR 2022 [62]. The architecture is used as feature extractor by removing the final decision layer. For each face image, the model generates a 1 × 2048 feature vector.
In our experiments we apply Transfer Learning (TL) [63] to adapt from one domain to another (e.g. from Face Analysis to the Action Unit domain). TL are methods where weights from a model originally learned for one task are used as initialization before adjusting the model for a different task. One of the transfer learning techniques consists in freezing intermediate and initial layers to retain their capability to extract general characteristics and retrain the last layers closer to the network output. Re-training of those last layers allows to adapt the original feature space for the new task. These methods are suitable for problems where data is scarce and end-to-end learning approaches fail to find the optimal feature space. The number and size of available databases to model hypomimia in patients suffering from PD are very small (typically less than 100 subjects and less than 1,000 images in total), so we expect that TL techniques will be very useful here to adapt to the Parkinson domain from the Face Analysis domain, where massive datasets are available for learning (millions of images).
Face action unit detection models. In addition to the Face Analysis model, in this work we employ two deep neural networks trained from scratch for Face Action Unit (FAU) detection. The architectures employed are based on the popular VGG and ResNet models [48,64]. The details of the two models are described below: VGG-8: This model contains 8 convolutional layers divided into groups of 2 layers. Each group is followed by a Max pooling layer. Convolutional layers apply a variety of filters to the images and Max-Pooling layers reduce the size of the filtered images. Additionally, dropout is used in the regularization layers to randomly discard neurons in the model and make it less prone to overfitting. The final part of the architecture has a total of six convolutional layers (fully-connected) before the decision layer. The number of neurons per layer is 1024, 512, 256, 128, 64, and 32. The number of parameters of this model is 295,448.
ResNet-7: The ResNet model is composed of a total of 7 residual blocks. Each block can be defined as an identity-block or a conv-block. The identity-blocks are the standard blocks used in ResNet, they have a set of convolutional filters and a shortcut connection which bypasses these blocks. This block has the same input and output dimensions. Conv-blocks are the block types where the input and output dimensions do not match. The difference with the identityblock is a convolutional layer in the shortcut to the output. The benefit of these architectures is that in traditional architectures by having a high amount of layers in the training, the problem of error degradation appears. ResNet models with their previous layer shortcut connections are effective in solving this problem [52]. The number of parameters of this model is 366,626.
Triplet loss for facial expression analysis. Due to the limited number of samples in the FacePark-GITA database, for the Parkinson domain adaptation we opted for a Triplet Loss learning approach. The Triplet Loss function consists in applying a linear transformation over the data before taking the distance among samples. Given a training data set S ¼ ðx i ; y i Þ with inputs x i 2 R d and discrete class labels y i 2 Z, the goal is to find a transformation to the input data such that reduces the distance between pairs from the same class while increases the distance between pairs from different classes. The Mahalanobis distance defined in Eq 1 is the similarity measure used in this work.
where M is a positive semi-definite symmetric matrix that can be decomposed as M = T T T, where T denotes a linear transformation matrix. Eq 1 can be rewritten as: The linear transformation T can be generalized as F(x i ), where F indicates a kernel function. The resulting distance metric is as follows: The process to determine the transformed vector F(x), requires to find a transformation that makes the intra-class distance smaller than the inter-class distance. The general rule which is applied over the data set consists in the following triplets S T : S T ¼ fðx a ; y a Þ; ðx n ; y n Þ; ðx p ; y p Þjy a ¼ y p ; y a 6 ¼ y n g ð5Þ where a, p are samples belonging to the same class, and n is a sample from a different class. In our Parkinson detection experiments, the number of classes is two (Healthy and Parkinson). However, we propose to introduce an additional restriction in the triplet. In our experiments, a, p belong to the same class, but present different face expression. The generation of the triplet S T can be seen as a data augmentation technique. The high number of possible combinations of three elements in a dataset enriches the training process, especially when low number of samples are available. The triplet loss function to be minimized is defined as: where [z] + = max(z, 0), and α � 0 is the minimum margin required between classes.

Classification and parameter optimization
The automatic classification between healthy people and PD patients is performed using Support Vector Machines (SVMs). The classification of patients with different degree of impairment is performed using SVMs optimized in a one vs. all strategy. In the binary classification experiments with SVMs, linear and Gaussian kernels are considered. The optimization of hyper-parameters is performed in a search grid up to powers of ten with C 2 {10 −4 , 10 −3 , 10 −2 , . . ., 10 2 , 10 3 } and γ 2 {10 −4 , 10 −3 , 10 −2 , . . ., 10 3 } for the Gaussian kernel, and for the linear kernel the search considered the same grid to C parameter. All models presented in this work are optimized following a nested 5-folds subject-independent cross-validation strategy and a data augmentation technique with random rotations between -10 and +10 degrees. Each fold has 864, 216, and 270 samples for training, validation, and testing. Classification results are reported in terms of accuracy (Acc), sensitivity (Sens), specificity (Spec), F1-Score (F1), and Area Under the receiver operating characteristic Curve (AUC). In all of the cases, results include values of the optimal hyper-parameters which are found as the mode along the parameters considered along the test folds of each experiment.

Experiment 1: Face analysis domain PD detection based on single face images-Baseline.
Individual frames corresponding to each valence level shown in Fig 2 are considered as the baseline to evaluate whether specific frames provide relevant information to discriminate between PD patients and healthy subjects. Feature vectors are obtained from the last layer of the ResNet50 model (see Section Face Analysis pre-trained model). Table 3 summarizes the results.
Note that there is almost no difference among the accuracies obtained with the frames of each expression stage. Perhaps the only thing to highlight is the high sensitivity (88.6%) of the Onset stage, which likely indicates that this stage is maybe a good choice to model hypomimia in specific frames within a video. This preliminary observation will be further elaborated in the next experiments.
PD detection based on image sequences. The three image sequences introduced in Image sequences extraction are used here to discriminate between PD patients and Healthy Control (HC) subjects. Table 4 shows the results obtained when the changes in the production of facial expressions are incorporated by feature vectors extracted from multi-frame sequences.
The results obtained by the multi-frame sequences are better than those obtained with individual frames. The improvement is around 7% and the best result is obtained with the two cases where the sequence NOnA is included, which is focused on modeling information in the transition between neutral and the production of a certain expression. It is also worth to highlight that sensitivity is near 90% in all of the cases, while specificity is rather low (around 64%). This indicates that the proposed approach is good to detect patients but not as good to detect healthy controls. This result validates the hypothesis H1 about the existence of useful information related to hypomimia in the elicited facial expressions. Given this clear improvement, the next experiments will include only feature vectors extracted from multi-frame sequences.

Experiment 2: Action unit domain
This experiment intends to incorporate information from the Action Unit domain to improve Parkinson's Disease (PD) detection. In this case the EmotioNet database is used to create an appropriate facial representation space. The first step consists in selecting AUs that provide suitable information to perform the automatic classification between PD patients and healthy subjects. We selected a subset of AUs according to [32] adequate for the facial expressions included in the recording tasks of the FacePark-GITA database. We included the AUs 1, 2, 4, 5, 6, 12, 25, and 26 from EmotioNet dataset; Motivated by the fact that AUs 4,5,25, and 26 are related to the anger expression (negative expression), the AUs 6, 12, and 25 are related to smile expression (positive expression), and the AUs 1, 2, 5, 25, and 26 are related to surprise (others expression) [65]. Fig 3 shows the set of selected AUs.
Adaptation from face analysis models. The process to adapt the convolutional models from one domain to another consists in freezing different percentages of the layers and retraining the remaining portion. The data with the selected AUs from the EmotioNet dataset are used here to retrain the models. In this case we evaluate three percentages of layers frozen during the retraining of the ResNet50 model (originally trained for Face Analysis): freezing 50% (Freeze 50-20.5M trainable parameters), freezing 75% (Freeze 75-16.0M trainable parameters), and freezing 100%. Note that the freezing 100% model is taken as the Baseline and corresponds to the case where no action units information is incorporated (x FA ). After the convolutional layers, a fully connected layer is added for the classification of the 8 selected AUs (see Fig 3). The result of the retraining process and its performance to classify the AUs is shown in Table 5 in terms of AUC and EER values. The accuracy varies depending of the FAU and the percentage of layers frozen. The FAUs numbers 6, 12, and 25 reached accuracies around 90%, while the rest of the FAUs achieved performances around 80%.
The representations x AU obtained by the retrained models are further used to classify between PD patients and healthy subjects of the FacePark-GITA corpus. The results obtained  Table 4. Optimal hyperparameters found in the 5-fold cross-validation process are also included in every experiment.
Note that the Freeze 75 exhibits higher accuracies than the Freeze 50, indicating that considerable information from the Face Analysis domain is still useful to obtain good results in the classification between PD patients and healthy subjects. More interestingly, note that the best accuracy obtained with the Freeze 75 model in Table 6 (87.3%) is 8.9% higher than the best result obtained when only a Face Analysis model is considered (Table 4). This result supports our second hypothesis (H2), the idea of incorporating information from the Action Unit domain to the Face Analysis domain to improve detection of hypomimia in PD patients. The benefits of including information of the Action Unit domain are also shown in Fig 4, where the ROC curves obtained with the Freeze 75, Freeze 50, and Baseline models are presented.

PLOS ONE
Note that the models used until this point of the study are based on architectures originally trained for Face Analysis tasks (ResNet50). Now we want to evaluate the importance of this initialization based on a Face Analysis training processes.
Training action unit models from scratch. The previous scenario studied the performance of pre-trained models with high number of parameters learned from the Face Analysis domain after adaptation to the Action Unit domain. In this section we will train FAU detection models from scratch. ResNet50 requires to optimize more than 20M parameters. Conversely, the VGG-8 and ResNet-7 architectures proposed in Face Action Unit detection models Section require the optimization of 295,448 and 366,626 parameters respectively. These reduced architectures are trained with the same data as those considered previously to retrain the Freeze 50 and Freeze 75 models. Table 8 shows the results with the AUC values obtained when the different AUs are detected. Note that these results are higher than those reported in Table 5 where greater number of parameters are optimized. However, the ResNet50 was originally trained for face recognition tasks, where face gestures are features to be excluded from the representation space. This result indicates that a simpler model might provide high enough AUs discrimination performance to be used in the classification between PD patients and healthy controls. Tables 9 and 10 show the results obtained when the aforementioned models, created with the reduced architectures, are used to discriminate between PD patients and healthy subjects. Note that no additional training is performed with data from Parkinson's disease patients. The best results are obtained when the ResNet-7 architecture is considered with features extracted from the NOnAOffN sequence.  Although 78.3% could be considered a good accuracy, it is still far from the best result obtained with the ResNet50 Freeze 75 model (87.3% in Table 6), indicating that the FAU domain is missing certain features present in the Face Analysis domain.

Experiment 3: Parkinson's domain (PD detection)
In this section, different strategies are explored with the aim to evaluate their suitability to model specific patterns that appear on the face of PD patients. First, a simple model based on CNNs is trained from scratch and later the use of a triplet loss function is explored to evaluate whether the classification performance of PD patients vs. Healthy Control (HC) subjects is improved. The triplet loss function modifies the original representation space such that the inter-class separability is increased while the intra-class separability is reduced. The modified feature vectors are called embedded vectors.
Training Parkinson detection models from scratch. The previous section showed the benefits of the Action Units domain adaptation. This experiment is performed considering the NOnAOffN sequence from the FacePark-GITA database as training and test data. The use of this sequence is motivated by the need of maximizing the amount of data in the training process. The models are directly trained, i.e., from scratch, with randomly initialized weights with cross-entropy as the loss function. Table 11 shows the results of this experiment, which can be considered also as a baseline regarding the use of deep learning -based architectures.  Table 11 shows the performance of the models created from scratch trained with the NOn-AOffN sequences of the FacePark-GITA database. The results show that the accuracy of the VGG-8 with randomly initialized weights are comparable to those obtained when the adapted Action Unit model is considered (approximately 67.7% in both cases). However, it can also be observed that these results have high levels of variance compared to the results in Table 9 (i.e. 32.8% and 7.4% Sens variance for PD domain and AU domain respectively). When comparing results in Tables 10 and 11, it can be observed that ResNet-7 has higher variance but lower accuracy than the adapted Action Unit model (71.7% and 78.8%, respectively). This is likely due to the lack of enough data to appropriately train the model, which highlights the convenience of applying TL techniques.
Triplet loss in face analysis models adapted to the action unit domain. The Freeze 75 and Freeze 50 models are trained with the triplet loss function strategy and two new models are obtained, namely Triplet 75 and Triplet 50, respectively. The FacePark-GITA database is divided into a 5-fold partition for the training of each Triplet model and the SVM classifier. The classification results obtained when using the embedded vectors are shown in Table 12 for the Triplet 75 model, and in Table 13 for the Triplet 50 model.
Note that the Triplet 75 model exhibits better accuracy (86.0%) than the Triplet 50 (80.7%). Since the best accuracies in the previous experiments with the Freeze 75 and Freeze 50 models were 87.3% and 83.1%, these new results obtained with the triplet loss strategy likely indicate that the embedding approach does not provide advantages over the use of transfer learning and freezing of layers. This observation is also supported in the fact that the number of parameters to be optimized has not been reduced, so in principle, there is no reason for using the triplet loss function in these two scenarios.
Triplet loss in FAU detection trained from scratch. In this experiment the VGG-8 and ResNet-7 models are retrained considering the triplet loss function, creating two new models, Note that there is an improvement in both models compared to those based on VGG-8 and ResNet-7 where the triplet loss function was not applied. In the first case the improvement is around 5.1% (from 67.6% to 72.7%) and in the second case is around 3.6% (from 78.8% to 82.4%). These results partially validates our third hypothesis (H3) indicating that loss functions designed to learn from the PD domain serve to improve the performance of PD classification. It is not only interesting to highlight the improvement achieved when using the triplet loss function, but also to note that the best result obtained with the Triplet-ResNet7 model is competitive compared to the best accuracy previously obtained with the Freeze 75 model. Although the accuracy in the second one is 4.9% above the first one, Freeze 75 requires 17,815,520 more parameters to be optimized than Triplet-ResNet7, which might indicate a better generalization capability. Further experiments with additional data are required to validate this hypothesis.
PCA is now used to create a 2D representation of the feature spaces learned in previous experiments. Fig 6 shows the feature spaces and the distribution of the classification scores. The figure shows a superior discrimination capability of the x AU feature space (ResNet50 adapted to the FAU domain). The representation obtained by the Triplet-ResNet7 model shows a larger margin between classes but the misclassification errors decrease the performance.
Finally, we performed 25 random nested cross-validations with the hyper-parameters found previously to generate accuracy sets to realize the Kruskal-Wallis test between the three

Discussion and conclusion
This study presents a novel approach where deep learning methods are used to model hypomimia in PD patients. Videos with the face of people while eliciting facial expressions are considered for the study. Frames of the recorded videos are segmented into different stages during the production of elicited expressions: neutral, onset-transition, apex, offset-transition, and neutral. This approach exhibits improvements of up to 5.5% in accuracy (from 72.9% to 78.4%) with respect to classical approaches where single frames are considered. These results suggest that dynamics information is more suitable to model hypomimia in PD patients. We are aware of the fact that the presented approach does not completely exploit the video  dynamics; however, the incorporation of frames in different stages during the production of facial expressions shows to be a good and computationally affordable approach. Later, information from the Action Unit domain is incorporated in the model by means of transfer learning methods. Transfer learning was performed considering the complete architecture of a base model previously trained with massive data and then freezing some layers to fine-tune the remaining layers with the smaller action units data. Results freezing 75% and 50% of the layers are reported. The results show that the Action Unit domain adaptation provides an improvement of 8.9%, from 78.4% to 87.3% of accuracy in PD detection. These results confirm that domain adaptation via transfer learning methods is a good strategy to model hypomimia in PD patients. Considering the good results and also the fact that only up to five images per participant are considered in the experiments, we believe that this study is a step forward in the development of automatic methods for the detection and monitoring of PD symptoms related with the production of facial expressions.
With the aim of finding lighter approaches suitable to be used in portable devices, other experiments with reduced architectures like VGG-8 and ResNet7 were also addressed. However, the results were not satisfactory, i.e., the maximal accuracies in these cases were 67.6% and 78.8%, respectively. The results were further improved up to 72.7% and 82.4% when the triplet loss strategy was considered. Each row in the figure shows the changes in the activation maps in three different columns: FA Domain which corresponds to the classical Face Analysis domain and focuses broadly on the faces of the participants. The AU domain which shows concentration in more specific regions over the face, where these regions are highly related to the facial action units. Notice that the regions activated in the AU Domain of Fig 7 (second column) are related to the right wink task, while the AU Domain images in Fig 7 show more intensive regions over the lips, which are closely related to the smile task. And finally, the third column which corresponds to the Parkinson's domain when the Triplet loss function is applied. Notice that in this case the concentration in the upper face area is intensified, indicating that it is the one that provides better discriminative capability to detect PD. The computer vision approach applied to model hypomimia effects present limitations. More research is needed before these approaches have a direct impact on patients' lives. The study of connections and patterns between emotions, facial expressions, and hypomimia symptoms will allow to improve computer vision approaches. In this respect, the lack of large dataset acquired by multidisciplinary teams including PD experts and machine learning experts is a major handicap for the advancement of the research community. The availability of larger corpus will allow to study the use of more sophisticated machine learning architectures such as MobileNets, ShuffleNet, Multiresolution ensemble structures or other technologies to integrate information provided by video sequences, including video tracking of facial features and other modalities [54], like speech, gait, handwriting [11], and human-computer interaction signals [66].